This notebook demonstrates an example natural language processing & machine learning project using Python and the "PyData" ecosystem. The problem we're trying to solve is estimating a restaurant's overall "star" rating on Yelp using only the text that customers have written about the restaurant in Yelp reviews as an input.
Yelp is a business listings and review service available in a number of countries around the world. Yelp is an attractive subject for a machine learning demo because the company publishes a meaningful chunk of their overall dataset for use in educational purposes. The dataset is available for download in the JSON format on the Yelp Dataset Challenge page.
For this notebook to work as written, you'll need to download the Yelp Dataset at the link above and place the yelp_dataset_challenge_academic_dataset directory in the notebook's project folder.
As you work through the notebook, you are encouraged to run the code live to see it in action. Feel free to edit, re-run, and experiment with the code yourself!
This first cell imports some libraries that we'll use later. Extensive documentation on each of these libraries and what they do is readily available online.
import os
import json
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.svm import LinearSVR, LinearSVC
from bokeh.plotting import figure, gridplot, output_notebook, show
from bokeh.models import ColumnDataSource
output_notebook()
This cell assigns the location of the Yelp Dataset files to the DataDirectory variable.
DataDirectory = os.path.join(os.getcwd(), 'yelp_dataset_challenge_academic_dataset')
DataDirectory
Here, I'm reading in the business data file and creating a Pandas DataFrame named businesses out of it.
BusinessesFilePath = os.path.join(DataDirectory, 'yelp_academic_dataset_business.json')
with open(BusinessesFilePath, 'r') as f:
BusinessData = f.readlines()
BusinessData = map(lambda x: x.rstrip(), BusinessData)
BusinessJsonStr = "[" + ','.join(BusinessData) + "]"
businesses = pd.read_json(BusinessJsonStr)
businesses.info()
61,184 businesses is a lot of data for a simple demo, so let's filter things down. I lived in Pittsburgh for a long time, so let's filter our businesses DataFrame down to restaurants in the city of Pittsburgh, and store the result in a new DataFrame named PittsburghBusinesses.
PittsburghBusinesses = businesses[(businesses['city'] == 'Pittsburgh')]
PittsburghBusinesses['Restaurant?'] = PittsburghBusinesses['categories'].apply(lambda x: True if 'Restaurants' in x else False)
PittsburghBusinesses = PittsburghBusinesses[PittsburghBusinesses['Restaurant?'] == True]
PittsburghBusinesses = PittsburghBusinesses.set_index('business_id')
PittsburghBusinesses.shape
Now we have 1,217 business listings in our dataset. In the next cell, I create a "frozenset" called BusinessSet containing the business_id variables associated with each business. "frozenset" objects are good for facilitating fast lookups.
BusinessSet = frozenset(PittsburghBusinesses.index)
len(BusinessSet)
Next, let's read in the customer reviews. There are ~1.6M reviews in the overall dataset. We only need the reviews associated with our set of restaurants in Pittsburgh. Let's read in the reviews from the dataset file one at a time, and if they're associated with one of the business_ids we have stored in BusinessSet, let's add them to a list called PittsburghReviews for further processing.
This cell may take a minute or two to run.
ReviewsFilePath = os.path.join(DataDirectory, 'yelp_academic_dataset_review.json')
PittsburghReviews = []
with open(ReviewsFilePath, 'r') as ReviewsFile:
for line in ReviewsFile:
ThisReview = json.loads(line)
if ThisReview['business_id'] in BusinessSet:
PittsburghReviews.append(ThisReview)
len(PittsburghReviews)
Looks like we have 44,126 customer reviews associated with our 1,217 restaurants.
Let's display the first customer review in the list to see how the data is formatted.
PittsburghReviews[0]
Let's turn our list of customer reviews into a Pandas DataFrame, named PittsburghReviewsDf, which will make working with the data a little easier.
PittsburghReviewsDf = pd.DataFrame(PittsburghReviews)
PittsburghReviewsDf.info()
Next, we need to connect data from the customer reviews with the restaurants that the reviews are about. Here, we iterate through the rows of the PittsburghBusinesses dataframe and we add three new columns related to the customer reviews:
This cell may take a minute or two to run.
PittsburghBusinesses['ReviewCount'] = 0
PittsburghBusinesses['CalculatedAverageStars'] = 0
PittsburghBusinesses['TextFromReviews'] = ' '
for business_id, business_data in PittsburghBusinesses.iterrows():
ReviewsThisBusiness = PittsburghReviewsDf[PittsburghReviewsDf['business_id'] == business_id]
CountOfReviews = len(ReviewsThisBusiness)
CalculatedAverageStars = ReviewsThisBusiness['stars'].mean()
TextFromReviews = ' '.join(ReviewsThisBusiness['text'].tolist())
PittsburghBusinesses.loc[business_id, 'ReviewCount':'TextFromReviews'] = [CountOfReviews,
CalculatedAverageStars,
TextFromReviews]
PittsburghBusinesses.info()
Let's take a look at a preview of our PittsburghBusinesses DataFrame, including the new columns we've added for each business.
PittsburghBusinesses.sample(3)
We're now ready to start the natural language processing & machine learning data pipeline. The first step is to split the dataset into a "training set", which is data that will be used to train the model, and a "testing set", which is data that is held out from the training process and can be used to evaluate the model's performance.
We'll use the first 900 restaurants in the dataset for training (TrainingData) and the remaining 317 (1,217 minus 900) for testing (TestingData)
TrainingData = PittsburghBusinesses.iloc[0:900]
TestingData = PittsburghBusinesses.iloc[900:]
Next, we'll create CountVectorizer and TfidfTransformer objects, which are common starting points to use natural language text as inputs for machine learning. The CountVectorizer counts the occurrence of all individual words and phrases in every review, while the TfidfTransformer re-weights the word and phrase counts so that very common words and phrases tend to receive lower weight, while relatively rare words and phrases tend to receive a higher weight. We'll "fit" the CountVectorizer and TfidfTransformer objects to the review text associated with businesses in the training set.
We can use the CountVectorizer and TfidfTransformer implementations from the scikit-learn library. Documentation for each can be found on the scikit-learn website.
This cell will take a minute or two to run.
MyCountVectorizer = CountVectorizer(min_df=0.03, ngram_range=(1,3))
MyTfidfTransformer = TfidfTransformer()
TrainingWordCountMatrix = MyCountVectorizer.fit_transform(TrainingData['TextFromReviews'].tolist())
TrainingTfidfMatrix = MyTfidfTransformer.fit_transform(TrainingWordCountMatrix)
TrainingTfidfMatrix.shape
It's now time to train a machine learning model. As data science practictioners, we know a couple of things about this problem that will assist with selecting an appropriate algorithm:
Linear support vector machines models are known to work well for high-dimensional natural language problems, so let's use a linear support vector machine algorithm intended for regression problems. We can use the LinearSVR implementation from the scikit-learn library.
Let's create a model, YelpReviewModel, and train ("fit") it using the training data. Our predictive features will be the TfidfMatrix produced by our CountVectorizer/TfidfTransformer, and the average star rating will be the target variable to estimate.
YelpReviewModel = LinearSVR()
YelpReviewModel.fit(TrainingTfidfMatrix, TrainingData['CalculatedAverageStars'])
Now our model is trained! Let's take a minute to do some manual introspection into our model to try to understand it a bit better. What words and phrases were assigned a high positive weight by the model? In other words, people using them in reviews suggests a high restaurant rating...
weights = YelpReviewModel.coef_
terms = zip(*sorted(MyCountVectorizer.vocabulary_.items()))[0]
TermWeights = pd.Series(weights, index=terms)
TermWeights.sort_values(ascending=False, inplace=True)
TermWeights.head(10)
Now the words and phrases with very negative weights -- in other words, people using them in reviews suggests a low restaurant rating:
TermWeights.tail(10)
Based on our initial inspection, it looks like our model has detected some really broad, meaningful patterns mapping the words and phrases customers use with the star ratings the restaurants receive!
Next, we need to evaluate the performance of our trained model in a more rigorous and mathematical way. Let's use our CountVectorizer and TfidfTransformer objects on the text from customer reviews for the restaurants in TestingData to transform it into a format that can be used by our LinearSVR model.
TestingWordCountMatrix = MyCountVectorizer.transform(TestingData['TextFromReviews'])
TestingTfidfMatrix = MyTfidfTransformer.transform(TestingWordCountMatrix)
TestingWordCountMatrix.shape
We're ready to make our predictions on the testing data! Let's apply our YelpReviewModel to the transformed review text for each restaurant in the testing data to make a raw prediction (predict). We'll then clip the predictions to ensure that they're constrained within a range of [0, 5], which is the valid range for star ratings on Yelp. Next, we'll calculate the error terms -- how far off our model's prediction was from the true average star rating, as calculated from the reviews.
Finally, we'll also calculate the overall average star rating for all restaurants in our training data as another type of "prediction". In the absence of our model, this average star rating would be a reasonable "prediction" we'd have available for the star rating of any individual restaurant. It's useful as a baseline with which to compare our model's performance.
You may see some SettingWithCopyWarning messages come from this cell -- that's OK.
TestingData['PredictedStars'] = YelpReviewModel.predict(TestingTfidfMatrix)
TestingData['PredictedStars'] = TestingData['PredictedStars'].apply(lambda x: np.clip(x, 0, 5))
TestingData['Error'] = TestingData['PredictedStars'] - TestingData['CalculatedAverageStars']
TestingData['Baseline-OverallAverage'] = TrainingData['CalculatedAverageStars'].mean()
Let's preview the testing results.
TestingData[['ReviewCount', 'CalculatedAverageStars', 'PredictedStars', 'Error']].sample(10)
In general, the true and predicted values look pretty close! Can we describe the accuracy of the model in more mathematical terms?
Let's calculate the R-squared value for the model's predictions using the built-in .score() method.
r2 = YelpReviewModel.score(TestingTfidfMatrix, TestingData['CalculatedAverageStars'])
print '{:.1%}'.format(r2)
A majority of the variability in restaurants' average star ratings can be explained by our model!
Below, I've written a quick custom function, calculateRmse(), to calcuate "Root Mean Squared Error" (RMSE), which is another common performance metric for regression problems. Lower RMSE values mean better predictions.
We'll calculate both the RMSE for our model's predictions and the baseline RMSE (using the overall average rating) for comparison.
def calculateRmse(TrueValues, PredictedValues):
Errors = TrueValues - PredictedValues
SquaredErrors = np.power(Errors, 2)
MeanSquaredError = SquaredErrors.mean()
RootMeanSquaredError = np.sqrt(MeanSquaredError)
return RootMeanSquaredError
ModelRmse = calculateRmse(TestingData['CalculatedAverageStars'], TestingData['PredictedStars'])
BaselineRmse = calculateRmse(TestingData['CalculatedAverageStars'], TestingData['Baseline-OverallAverage'])
print 'Model RMSE: {:.2}'.format(ModelRmse)
print 'Baseline RMSE: {:.2}'.format(BaselineRmse)
According to the RMSE metric, the model's predictions improve on the baseline by 38%.
Next, we'll visualize the results to get a better understanding of what's going on. Below, I've used the bokeh visualization library to create a scatterplot. Each dot on the scatterplot is a restaurant in our testing set. Restaurants' true star ratings are measured on the X axis and our model's predicted star ratings are on the Y axis. The size of each dot indicates the number of reviews that the restaurant received.
The green line at a 45 degree angle indicates perfect performance, where the predicted rating exactly equals the true rating. The horizontal red line shows the average star rating across all restaurants in our training dataset, which, without our model, would be the best "predicted stars" assumption we'd have available for any individual restaurant.
Note that bokeh's built-in interactivity allows you to explore the data by panning, zooming, resizing, and selecting data.
source = ColumnDataSource(TestingData)
tools = 'pan,lasso_select,box_zoom,resize,wheel_zoom,reset,save'
plot1 = figure(tools=tools)
plot1.line([0, 5.25],
[0, 5.25],
line_color='green',
line_width=2,
line_alpha=0.5,
legend='Perfect Performance')
plot1.line([0, 5.25],
[TrainingData['CalculatedAverageStars'].mean(), TrainingData['CalculatedAverageStars'].mean()],
line_color='red',
line_alpha=0.5,
legend='Average Restaurant Rating')
plot1.scatter('CalculatedAverageStars',
'PredictedStars',
size=np.sqrt(TestingData['ReviewCount']),
alpha=0.6,
line_color=None,
source=source,
legend='Restaurants')
plot1.xaxis.axis_label = 'True Stars'
plot1.yaxis.axis_label = 'Predicted Stars'
plot1.legend.orientation = 'bottom_right'
show(plot1)
This visualization tells us a lot about the dataset in general and how well the model is performing at predicting restaurants' true star ratings. For example:
We can explore these second and third points more deeply with another visualization. I've created another scatterplot, again with each restaurant in the testing set represented as a dot. This time, the X axis shows the number of customer reviews each restaurant received and the Y axis shows the model's prediction error. The dot colors indicate the relative true star rating that the restuarants received, with restaurants with the lowest true star ratings colored blue ("cold"), restaurants with the highest true star ratings colored red ("hot"), and middling star ratings somewhere in between. A horizontal green line at 0 prediction error indicates perfect performance.
# this code cell maps colors from a (r, g, b, a) representation to a #xxxxxx hex representation.
seismic = plt.get_cmap('seismic')
MinStar = TestingData['CalculatedAverageStars'].min()
MaxStar = TestingData['CalculatedAverageStars'].max()
StarRange = MaxStar - MinStar
ScaledStarRating = (TestingData['CalculatedAverageStars'] - MinStar) / StarRange
def convert_to_hex(rgba_color):
red = int(rgba_color[0]*225)
green = int(rgba_color[1]*225)
blue = int(rgba_color[2]*225)
return '#{:02x}{:02x}{:02x}'.format(red, green, blue)
DotColors = [convert_to_hex(x) for x in seismic(ScaledStarRating)]
plot2 = figure(tools=tools)
plot2.line([-10, 300],
[0, 0],
line_color='green',
line_alpha=0.5,
line_width=2.,
legend='Perfect Prediction')
plot2.scatter('ReviewCount',
'Error',
fill_color = DotColors,
fill_alpha = 0.8,
line_color = None,
source=source,
legend='Restaurants')
plot2.xaxis.axis_label = 'Count of Reviews'
plot2.yaxis.axis_label = 'Prediction Error'
show(plot2)
With this view, can confirm some of our suspicions from before:
Note that because the two plots reference the same DataFrame (TestingData) as their source, the plots are linked. If you select data in one plot, the corresponding data will also be highlighted in the other, enabling further insights by using both views of the data together.
At this point, there are a number of things we could do to try to further analyze our model or improve its performance. We could go back and expand the dataset, so we're looking at more restaurants, perhaps including other cities. We could also try to include more predictive features, such as the style of food the restaurant serves, or what neighborhood the restaurants are located in. We could also try tweaking back-end configuration variables in the NLP/machine learning pipeline itself (known as "hyperparameter tuning") to see what impact that might have on the model's predictions.
On the other hand, if this model and its performance yield meaningful value as is, it could be time to deploy the model in a production setting!
This demo has focused on restaurant and customer review data from Yelp, but you can see how the concepts could generalize to any number of topics. What if...
Happy learning!